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Linearly solvable Markov Decision Process (LMDP) is a class of optimal control problem in 
which the Bellman's equation can be converted into a linear equation by an exponential 
transformation of the state value function (Todorov, 2009b). In an LMDR the optimal 
value function and the corresponding control policy are obtained by solving an eigenvalue 
problem in a discrete state space or an eigenfunction problem in a continuous state using 
the knowledge of the system dynamics and the action, state, and terminal cost functions. 
In this study, we evaluate the effectiveness of the LMDP framework in real robot control, in 
which the dynamics of the body and the environment have to be learned from experience. 
We first perform a simulation study of a pole swing-up task to evaluate the effect of 
the accuracy of the learned dynamics model on the derived the action policy. The result 
shows that a crude linear approximation of the non-linear dynamics can still allow solution 
of the task, despite with a higher total cost. We then perform real robot experiments of 
a battery-catching task using our Spring Dog mobile robot platform. The state is given by 
the position and the size of a battery in its camera view and two neck joint angles. The 
action is the velocities of two wheels, while the neck joints were controlled by a visual 
servo controller. We test linear and bilinear dynamic models in tasks with quadratic and 
Guassian state cost functions. In the quadratic cost task, the LMDP controller derived 
from a learned linear dynamics model performed equivalently with the optimal linear 
quadratic regulator (LQR). In the non-quadratic task, the LMDP controller with a linear 
dynamics model showed the best performance. The results demonstrate the usefulness 
of the LMDP framework in real robot control even when simple linear models are used for 
dynamics learning. 



Keywords: optimal control, linearly solvable Markov decision process, model-based reinforcement learning, model 
learning, robot navigation 



1. INTRODUCTION 

When we want to design an autonomous robot that can act 
optimally in its environment, the robot should solve non-linear 
optimization problems in continuous state and action spaces. If a 
precise model of the environment is available, then both optimal 
control (Todorov, 2006) and model-based reinforcement learn- 
ing (Barto and Sutton, 1998) give a computational framework to 
find an optimal control policy which minimizes cumulative costs 
(or maximizes cumulative rewards). In recent years, reinforce- 
ment learning algorithms have been applied to a wide range of 
neuroscience data (Niv, 2009) and model-based approaches have 
been receiving attention among researchers who are interested in 
decision making (Daw et al., 201 1; Doll et al., 2012). 

However, a drawback is the difficulty to find an optimal 
policy for continuous states and actions. Specifically, the non- 
linear Hamilton-Jacobi-Bellman (HJB) equation must be solved 
in order to derive an optimal policy. Dynamic programming 
solves the Bellman equation, which is a discrete-time version 
of the HJB equation, for discrete states and actions problems. 



Linear Quadratic Regulator (LQR) is one of the well-known 
optimal control methods for a linear dynamical system with 
a quadratic cost function. It can handle continuous states and 
actions, but it is not applicable to non-linear systems. 

Recently, a new framework of linearly solvable Markov deci- 
sion process (LMDP) has been proposed, in which a non-linear 
Bellman's equation for discrete and continuous systems is con- 
verted into a linear equation under certain assumptions on the 
action cost and the effect action on the state dynamics (Doya, 
2009; Todorov, 2009b). In fact, the basis idea of linearization 
of the HJB equation using logarithmic transformation has been 
shown in the book written by Flemming and Soner and its con- 
nection to risk sensitive control has been discussed in the field of 
control theory (Fleming and Soner, 2006). Their study has been 
receiving attention recently in the field of robotics and machine 
learning fields (Theodorou and Todorov, 2012) because there 
exist a number of interesting properties in the linearized Bellman 
equation (Todorov, 2009b). There are two major approaches in 
LMDP: the path integral approach (Kappen, 2005a,b) and the 
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desirability function approach (Todorov, 2009b). They are closely 
related and new theoretical findings are reported (Theodorou and 
Todorov, 2012), but there are some differences in practice. In 
the path integral approach, the linearized Bellman is computed 
along paths starting from given initial states using sampling meth- 
ods. The path integral approach has been successfully applied 
to learning of stochastic policies for robots with large degrees 
of freedom (Theodorou et al, 2010; Sugimoto and Morimoto, 
201 1; Stulp and Sigaud, 2012). The path integral approach is best 
suited for optimization around stereotyped motion trajectories. 
However, an additional learning is needed when a new initial 
state or a new goal state is given. In the value-based approach, 
an exponentially transformed state value function is defined as 
the desirability function and it is derived from the linearized 
Bellman's equation by solving an eigenvalue problem (Todorov, 
2007) or an eigenfunction problem (Todorov, 2009c; Zhong and 
Todorov, 2011). One of the benefits of the desirability function 
approach is its compositionality. Linearity of the Bellman equa- 
tion enables deriving an optimal policy for a composite task 
from previously learned optimal policies for basic tasks by lin- 
ear weighting by the desirability functions (da Silva et al., 2009; 
Todorov, 2009a). However, the desirability function approach has 
so far been tested only in simulation. In this study, we test the 
applicability of the desirability function approach to real robot 
control. 

In order to apply the LMDP framework to real robot applica- 
tions, the environmental dynamics should be estimated through 
the interaction with the environment. This paper proposes a 
method which integrates model learning with the LMDP frame- 
work and investigates how the accuracy of the learned model 
affects that of the desirability function, the corresponding policy, 
and the task performance. Although Burdelis and Ikeda proposed 
a similar approach for the system with discrete states and actions 
(Burdelis and Ikeda, 2011), it is not applicable to a continuous 
domain. We test the proposed method in two tasks. The first 
task is a well-known benchmark, the pole swing-up problem. 
We use linear and non-linear models for approximation of the 
environmental dynamics and investigate how the accuracy of the 
dynamics model affects the estimated desirability function and 
the corresponding policy. The second task is a visually guided 
navigation problem using our Spring Dog robot which has six 
degrees of freedom. The landmark with the LED is located in 
the environment and the Spring Dog should approach the land- 
mark. We compare linear and bilinear dynamics models with 
quadratic and Gaussian state cost functions. Experimental results 
showed that the LMDP framework with model learning is appli- 
cable to real robot learning even when simple dynamics models 
are used. 

2. MATERIALS AND METHODS 

2.1. LINEARLY SOLVABLE MARKOV DECISION PROCESS 

At first, we show how a non-linear Bellman's equation can be 
made linear under the LMDP setting formulated by Todorov 
(2009b). Let X C R N * andU c R N » be the continuous state and 
continuous action spaces, where N x and N u are the dimension- 
ality of the spaces, respectively. At time t, the robot observes 
the environmental current state x(t) e X and executes action 



u(t) e U. Consequently, the robot receives an immediate cost 
c(x(t),u(t)) and the environment makes a state transition 
according to the following continuous-time stochastic differential 
equation, 

dx = a(x)dt + B(x)(udt + adco), (1) 

where w e M^" and a denote Brownian noise and a scaling 
parameter for the noise, respectively. a(x) describes the passive 
dynamics of the system while B(x) represents the input- gain 
matrix. Note that Equation (1) is generally non-linear with 
respect to the state x but linear with respect to the action u. 
It is convenient to represent Equation (1) in discrete time. By 
discretizing the time axis with step /z, we obtain the following 
transition probability, 

p Uk {x k+l \x k ) =Ar(x k + l \\L(x k ,u k )+x k ,hIl(x)), (2) 

where AT(x\[i, £) denotes a Gaussian distribution with mean |x 
and covariance matrix £ , and 



[i(x, u) = h(a(x) + B(x)u), 
H(x) = oB(x) T B(x), 



(3) 
(4) 



where [i(x, u) can be regarded as a deterministic state 
transition function. Note that x k = x(hk) and u k = u(hk). 
It should be noted that a state transition probability is 
defined as an uncontrolled probability when no control is 
applied (u = 0), and otherwise, it is called a controlled 
probability. 

A control policy or controller tz(u\x) is defined as a probability 
of selecting the action u at state x. When the goal of the robot is 
to find an optimal control policy tt* that can lead the robot to the 
desired state x g e X g c X, the objective function is formulated 
as minimization of the expected value of cumulative costs, 



v 7Z (x) = E 



^2 c ( x k> +g(*g) 



(5) 



where and c(x, u) and g(x), respectively denote the immediate 
and terminal cost. T g represents an arrival time. v n (x) is known 
as a cost-to-go or value function. The optimal value function is 
the minimal expected cumulative cost defined by 



v* (x) = min v 71 (x) . 



(6) 



It is known that the optimal value function satisfies the following 
Bellman's equation 

v*(x) = min (c(x, u) + E^ p u^ x) v*(x!)) (7) 

Since Equation (7) is non-linear, it is difficult to solve the opti- 
mal value function in general. However, the Bellman's equation 
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is simplified if it is assumed that the immediate cost function is 
represented by 

c(x, u) = hq(x) + KL(p u (.\x)\\p°(.\x)), (8) 

where q(x) is a non-negative state cost function and the second 
term on the right hand side of Equation (8) is a control cost 
given as the KL- divergence between controlled and uncontrolled 
probability distributions. 1 In this case, the non-linear Bellman's 
equation is converted to the following linear equation 

z(x) = exp(-hq(x))g[z](x) (9) 
z(x g ) = exp(-g(x g )), x g e X g , 

where z(x) is the desirability function defined by 

z(x) = exp(-v*0;)). (10) 

Hereafter, Equation (9) is called a linearized Bellman's equation. 
The operator Q shown on the right hand side of the linearized 
Bellman's Equation (9) is the integral operator given by 

G\fK*) = f P°(x'\x)f(x')dx'. (11) 

It should be noted that Equation (9) is always satisfied by the 
trivial solution (z(x) = 0 for all x) if no boundary conditions are 
introduced. 

2.2. LEARNING MODEL PARAMETERS 

In the LMDP framework, the system dynamics (Equation 1) are 
assumed to be known in advance. When they are unknown, 
estimation of the dynamics is required from samples collected 
by the passive dynamics. Many methods exist which can esti- 
mate the system dynamics (Nguyen-Tuong and Peters, 2011; 
Sigaud et al, 2011), we adopt a simple least squares method to 
estimate a(x) and B(x) with basis functions. Specifically, we esti- 
mate a deterministic state transition (Equation 3). It should be 
noted that the scale parameter of noise a is generally unknown, 
but it is determined by the experimenters here since it can 
be regarded as the parameter that controls exploration of the 
environment. 

Let us suppose that the deterministic state transition [i(x, u) 
is approximated by the linear function with Ny basis functions 
<Pi(x, w), 

[i(x, u; W) = W T y(x, u). (12) 

where W is a weight matrix and <p(x, u) is a vector con- 
sisting of basis functions. Suppose that the training sam- 
ples {x\, Mi, . . . , xn s , un s , xn s -\-i} are obtained by the passive 



^he Kullback-Leibler (KL) divergence measures the difference between two 
distributions. If two distributions are the same, the KL-divergence becomes 0. 
In the LMDP, the control cost is defined by how certain control u affects on 
state transition probability. 



dynamics. The objective function of model learning is given by 
the following sum- of- squares error function, 

E = l - J2 [ Ax k ~ W T q>(x k , u k )\ 2 , (13) 

k=l 

where Ax k = x k + i — x\. Setting = 0 yields 

W= (O T 0) _1 O T AX, (14) 

where AX is the matrix whose a row vector consisted of state 
transition in each sample Ax k and O is also the matrix whose 
a column vector consisted of the basis functions in each sample 
<p(*fc, iifc). The detail is as follow, 

AX = [Axi ■ ■ ■ Ax Ns f , 0 = [<p(*i, Mi) • • • y(x Ns , u Ns )] . 

2.3. LEARNING A DESIRABILITY FUNCTION 

The desirability function is approximated by 

N z 

z{x\ w, 0) = J2 w if( x > °i) = w T /(*> 0)> (15) 

i= 1 

where W{ is a weight, w is the weight vector [w\, . . . , w^ z ] T , 
f{x, 6;) is a basis function parameterized by 0;, and/(x; 0) is the 
vector consisting of basis functions [fix; 6i), . . . , fix; 0jv z )] t . 
We adopt an unnormalized Gaussian function as Todorov sug- 
gested (Todorov, 2009c): 

fix; 60 = ex p(~ (* m if s i (* m i)j » e i = [m, ^} 

(16) 

where ntj and S,- denote a center position and a precision matrix 
of the f-th basis function, respectively. One advantage of using the 
Gaussian function that the integral operator (Equation 11) can be 
calculated analytically as follows: 

G[fi\(x) = Wi\~ l2 ex p(-^ (y - m i) TH i (y - m 0) . ( 17 ) 

where yix) = x + jx(x, 0), fi = fix, 0 Z ) for brevity and 

Hi = Si - SiCVT l C T Si, Vi = I + C r SiC, 
C = oh 1/2 B. 

It should be noted that y, Hi, Vi, C are functions of x. 

The desirability function (Equation 15) should satisfy the lin- 
earized Bellman's equation (9). Therefore, in order to optimize w 
and 0 we can construct the following objective function for given 
collocation states [x\, . . . , x^ c }\ 

L e ) J Lew(-g(x g ))] 

(18) 
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where F(Q) and G(0) are N c x N z matrices and their (n, i) 
components are defined by 



[Fm] ni =fi(x n ), 

[G(6)]„i = exp(-hq(x„))g[fi](x„). 



(19) 
(20) 



The objective function (Equation 18) is a quadratic function with 
respect to w and a non-linear function with respect to 0. See 
Appendix A for optimization of w and 0. 

2.4. OPTIMAL CONTROL POLICY 

In the LMDP framework, the optimal control policy is given by 



F G[z](x) 



(21) 



Specifically, when the dynamics are represented in the form of 
the stochastic differential equation (1) and the basis function 
of the approximated desirability function is Gaussian, then the 
optimal control policy is represented by 



u <*> = 0 L ^ — rU( 
i= i Z k Li">kG[fk(x)] 

di(x) = V^C^Si (nti -x- ha(x)) . 



(22) 



See Todorov (2009c) in more detail. 
2.5. EXPERIMENT 

In this paper, we conduct two experiments to evaluate the LMDP 
framework with model learning. One is a swing- up pole task in 
simulation. The other is a visually- guided navigation task using a 
real robot. 

2.5.1. Swing-up pole 

To verify that an appropriate control policy can be derived based 
on estimated dynamics, we conducted a computer simulation of 
the swing- up pole task. In the simulation, the one side of pole 
was fixed and the pole could rotate in plane around the fixed 
point as shown in Figure 1. The goal was to swing the pole to an 
upward position and stop at this position. The state in this task 
consisted of the vertical angle # and the angular velocity the 
origin of the state space was set at the goal position. It should be 
noted that # was normalized to be in the range (— it, 7t] (rad) 
while # was bounded: # e [— 4tt, 4 it] (rad /s). The control 
input and noise affected the torque of the pole. Therefore, 
the pole is assumed to obey the following stochastic state 
equation, 

d$ = $dt (23) 
dh = (ray sin(fl) - k&j dt + udt + odd), 

where /, ra, g, and k denote the length of the pole, mass, 
gravitational acceleration and coefficient of friction, respectively. 




The above state equation 
Equation (1) as follows; 



represented in the form of 



a(x) = 



raf sin(tf) — k$ 



f, B=[0,1]\ 



It should be note that the passive dynamics a(x) is a non -linear 
vector function of x while B is a constant vector. In this sim- 
ulation, the physical parameters were 1=1 (m), m = 1 (kg), 
g = 9.8 (kg/s 2 ) and k = 0.05 (kg m 2 /s). The state equation was 
discretized in time with a time step of h = 10 (ms) and the noise 
scale was set to a = 4. The state cost was defined so that it was 
zero at the goal state, using the following unnormalized Gaussian 
function, 



q(x) = (l- exp (* T £-st*)) 



(24) 



where diag(E cost ) = [0.1, 1.6]. 

As written in section 2.2, the weight matrix was estimated 
by Equation (14). In the sample acquisition phase we repeated 
simulations sufficiently, each simulation started from differ- 
ent initial states to avoid unevenly distributed samples. As a 
result, N = 1000 samples were extracted randomly as a training 
data set. 

In this simulation, we prepared two types of basis functions 
(p(x, w), as shown in Table 1, for approximation of the environ- 
mental dynamics. The first was a simple linear model with respect 
to x and u while the second model added the normalized radial 
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Table 1 | Basis functions used in the swing-up pole simulation. 



<P(*, u) 


Linear model 


[x T u J ] J 


Linear-NRBF model 


[x J \|n(x) ty 2 (x) ■■■ ^m(x) m t ] T 


basis functions (NRBF) 


u) to the linear model, 



exp(-i(x- [LifH^ix- [Li)) 

Mx) = ^7— ^ ~\' (25) 

J2 k ew{-\(x-V>k) T ^^ k (x-[L k )) 

The centers, |x z , of the basis functions, tyj(x, «), were determined 
by K-means clustering among the states of the training data. The 
co variance matrices were determined experimentally and set 
to diag(E^.) = [it /A, 7i]. In the linear-NRBF model, = 25 
basis functions were used. 

The set of collocation states {x\, . . . , xn s ], which were 
required to optimize the parameters of the desirability func- 
tion, were uniformly distributed in the state space. The centers 
nti of the basis functions fj(x) were initialized so as to dis- 
tribute them uniformly in the state space. On the other hand, 
the covariance matrices Sj were determined empirically and set to 
diag([16, 1]). The optimal control policy u*(x) was derived from 
Equation (22). 

2.5.2. Visually-guided navigation task 

To evaluate the performance of the optimal control policy derived 
from the estimated dynamics and the desirability function, we 
conducted a visual navigation task using a wheel type robot called 
the Spring Dog. Figure 2 shows the Spring Dog and the battery 
pack in the experimental field. The Spring Dog has six degrees 
of freedom: two fore legs, two rear wheels, and a pan-tilt cam- 
era head. There are several sensors such as a 3D accelerometer, 
a combined 3D gyroscope, and a USB camera mounted on the 
head, and so on. Three-color LED is attached to the top of the 
battery pack. 

Figure 3 shows the control diagram, where three control 
policies were implemented in this experiment. The first one 
was a visual servoing controller, which controlled the cam- 
era head so as to keep tracking the battery pack continuously. 
The second one was a navigation controller using the two 
rear wheels, this was optimized by the LMDP framework. In 
other words, the navigation controller controlled the left and 
right wheels in order to move around in the environment. The 
desired velocities of left and right wheels correspond to control 
input u in Equation (1). The last one was a seeking behav- 
ior, in which the Spring Dog explored the environment to find 
the battery pack when the robot lost track of it. The navi- 
gation controller learned by the LMDP framework while the 
visual servoing and searching controllers were designed by the 
experimenters. 

To realize a visually- guided navigation task, image binarization 
was applied to a captured image in order to separate the battery 
pack with the green LED from background. Some image features 
were calculated as shown in Figure 4. The state space consists of 




FIGURE 2 | Spring Dog, wheel typed robot and the battery pack. 



six variables described below: the center position of the battery 
pack (extracted pixels) in the image plane (x cx , x cy ), average of 
absolute values around the center in horizontal and vertical axes 
of the extracted pixels (x ax , x ay ), and the current joint angles of 
the neck controlled by the visual servoing controller. The state 
and action were summarized as follows: 

T T 
X — [x C x, Xcy-> Xaxi Xay-> ^tilt» ^pan] •> W = [wieft? bright ] • 

It should be noted that each value was scaled as follow, 

— 1 — %cx 5 %cy i ^tilt j ^pan — 1 ? 
0 <X ax ,X ay < 1, 

-1 < "left, Wright < 1- 

The desired state, x g , was set to comprise of both a posture 
and location which allowed the Spring Dog to successfully cap- 
ture of the battery. The view feed from the USB camera allowed 
recognition of the desired proximity and posture, as shown in 
Figure 2. 

Two types of state dependent cost functions q\ (x) and qi (x) 
were considered in the experiment. Each cost function was 
defined to be zero at the goal state as follows, 

qi (x) = a(x- x g ) T Y,-\ t (x - x g ) (26) 
q 2 (x) = a (l - exp(- (x - x g ) T (x - x g )Jj , (27) 

where a was a scaling constant. 
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image data _ . „ . .. . _ true , 
J Is the battery visible? | 



r — r 

lature 



Image processing \x ry ,x cy ,x nx ,x ay ^ 

Extract green color | f * P Controller: Tl(x) 

and calculate image featu 



U 



false 



Visual servoing 

Control the head 
to keep track the battery 



x till' x pan 



Seeking behavior 

Rotate around its center 
until the battery observed 



U 



FIGURE 3 | Control diagram in the Spring Dog. 



Table 2 | Basis functions used in the robot experiment. 




FIGURE 4 | Image binarization and image features. (A) Original captured 
image. (B) Binarized image. 



cp(x, u) 



Linear model 
Bilinear model 



[(*-%) T «T 

[(X-Xgf U\ e ft(X ~ Xgf U ught (X ~ Xgf U J ] 



Procedure 1 | Setting initial position of the centers of the basis 
functions, M init . 

Input: The date set of state, *D X . 

Output: The set of initial center positions, M /n/f 

Minit <- 0 
while T> x ^ 0 do 

x — ChooseSample(X> x ) 

T> x <- - {x} 

if V/ fj(x; mi) < x or M /n/f = 0 then 

^init <- M, n/ fUW 
end if 
end while 
return M /n/f 



Next we explain the procedure for estimation of visual-motor 
dynamics. At first, the Spring Dog moved around using the fixed 
stochastic policy and obtained data. In the experiment, the con- 
trol cycle was required to keep h = 300 =b 60 (ms), but it was 
sometimes violated interference from other processes. To deal 
with this problem in sampling, we rejected the corresponding 
data. In addition, If the target became invisible, or the tilt or pan 
angle reached by setting, its limitation, the corresponding data 
was rejected from samples also. As a result, we obtained the data 

uj]\.. 



After normalizing this 



set, V — i u Nt> j 

data set, the environmental dynamics were estimated as described 
in section 2.2. 

In this experiment, we used two types of basis functions 
(p(x, m), as shown in Table 2, to estimate visual-motor dynam- 
ics. If we apply the linear model for visual-motor dynamics and 
use a quadratic state cost function in Equation (26), the problem 
setting is identical to that of Linear Quadratic Regulator (LQR). 
Therefore, we can confirm that the LMDP finds the same optimal 
policy as LQR. 



As well as the swing- up pole task, collocation states 
{jci, . . . , xn s } were uniformly distributed in the state space, and 
the covariance matrices S z - were determined by hand. Moreover, 
only centers of basis functions of desirability were updated and 
covariance matrices were fixed in the experiment. The optimal 
control policy u*(x) was derived from Equation (22). The ini- 
tial position of the center ntj in each basis function fi(x) was 
taken from the data set of state, T> x = [x\, ... , x^,], which was 
extracted state data from the data set V. However, it was not 
appropriate for the computational resources of the real robot to 
use all of the data. For this reason, the set of initial positions of 
the centers of the basis functions, M zmt = [mi, ... , mjv z ]> were 
chosen from the data set of state V x following Procedure 1. As a 
result, at least one of the basis functions could return the value, 
which was over the threshold, t, for every samples. 

As already explained, to verify that LMDP can be apply to non- 
linear state transition system and non-quadratic cost function 
and the obtained controller performs optimal. In the experiment 
we tested the following four conditions: 
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1. Linear model + quadratic state cost. 

2. Bilinear model + quadratic state cost. 

3. Linear model + Gaussian based state cost (non-quadratic). 

4. Bilinear model + Gaussian based state cost (non-quadratic). 

Note that LQR can be applicable in the first condition. Therefore, 
LQR was also implemented to compare the result of the LMDP 
framework to the ground truth obtained from LQR in the first 
condition. 

3. RESULTS 

3.1. COMPUTER SIMULATION 

As described in the section 2.5.1, we used the linear and the 
linear-NRBF models to approximate the environmental dynam- 
ics of the swing-up pole. To evaluate the accuracy of estima- 
tion using these models, we measured the estimation errors. 
We extracted N = 500 samples randomly as a test data set and 
then calculated the estimates of the deterministic state transi- 
tion [i(x, u\ W) when two models were applied, respectively. 
After that, we computed the mean squared error (MSE) of each 
component, 

KT 

1 



MSE of the k-th component = — (Ax^ w — w^cpfe, u n )) 



(28) 



where w\ denotes the elements of k-th row in the weight 
matrix W. 

Figure 5 shows the MSE of the angle and angular velocity 
component. According to the this result, the estimation of the 
angle component was quite accurate in both models because it 
was deterministic transition. On the other hand, the estimation 
of the angular velocity component was inaccurate as compared 
with the angle component since it was a stochastic state tran- 
sition. According to Equations 2, 3 and the parameter setting 
of the time step, h = 10 (ms), the noise scale, a = 4, and B = 



| Linear model 

| Linear-NRBF model 




Angle [rad] 



Angular velocity [rad/s] 



FIGURE 5 | Mean squared error of the joint angle and angular velocity. 

Each error bar represents the standard deviation. 



[0, 1] T , the co variance matrix was derived diag (£) = [0, 0.04]. 
The covariance matrix affects to the MSE by square, the MSE 
between real deterministic state transition and an observed tem- 
poral state transition should be at least 1.6 x 10 -3 . The MSE 
of angular velocity component in the linear-NRBF model was 
also 1.6 x 10 -3 , it was suggested that most of the error was 
caused by noise. Consequently, This result suggested that the 
environmental dynamics were accurately approximated by the 
linear-NRBF model. The estimated input gain matrices were 
given by 



Bin 



o.ooool 

0.9965 



Bi 



inear-NRBF 



ro.ooool 
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they were very close to the true matrix B = [0, 1] T . 

The desirability function was optimized using the estimated 
dynamics and the control policy derived from the obtained desir- 
ability function. Figure 6 displays the results where the left panels 
show the desirability function z(x) and the right panels show the 
learned policy u* (x) . The black line in the right panels shows a 
typical trajectory of learned behaviors starting from x = [it, 0] T . 
The top panels of Figure 6 display the results using the true 
dynamics. It should be noted that the desirability function is 
discontinuous around the central diagonal band since this sys- 
tem is under- actuated. Simulation results using the linear and 
linear-NRBF models are shown in the middle and bottom pan- 
els of Figure 6, respectively. As compared with the result based on 
the true dynamics, both of the linear and linear-NRBF models 
could approximate the desirability function. However, the pol- 
icy obtained by the linear model was worse than that by the 
linear-NRBF model. 

To evaluate the performance in more detail, we measured the 
cumulative costs corresponding to each of the obtained policies. 
In this test simulation, the initial state was set to x = [it, 0] T 
which corresponds to the bottom position. Figure 7 shows mean 
cumulative costs of 50 episodes, each episode was terminated 
when the pole arrived at the goal state or the duration reached 
was over 20 (s) (2000 step). Note that the immediate cost in each 

step was calculated by c(x, u) = h(q(x) + II w ll 2 )- 

Figure 7 compares the cumulative costs among the three poli- 
cies. Not surprisingly, the control policy derived from the true 
dynamics achieved the best performance. It should be noted that 
the control policy based on the dynamics estimated with the 
linear-NRBF model produced a comparable performance, and it 
was better than the performance of the linear model. As discussed 
in the previous section, the linear-NRBF model gave more correct 
estimation than the linear model. Consequently, these results sug- 
gest that we can obtain the better control policy by forming more 
accurate estimates. 

3.2. REAL ROBOT EXPERIMENT 

As described in section 2.5.2, we used the linear and bilinear 
models for environmental dynamics approximation. After the 
data acquisition phase, we obtained Nx> = 9509 samples and we 
extracted N = 2500 samples for a test data set, the rest of samples 
were used as a training data set. As well as the swing- up the pole 
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FIGURE 6 | Results of the swing-up pole task. z(x) is on the left, u*(x) on the right, true dynamics at the top, linear model in the middle, and linear and 
NRBF model at the bottom. The black line shows a typical trajectory. 



task, we obtained weight matrix using Equation (14) and then 
calculated MSE in the test data set to evaluate the accuracy of 
estimation. 

Figure 8 shows the result. There was no significant difference 
between linear and bilinilear models. It suggests these models 
have almost the same quality for approximating environmen- 
tal dynamics. Comparing to other components, x cx and x pan 
derive larger MSE in both model. The reason is these compo- 
nents change more significantly than other components. During 
the sample acquisition phase, more movement in the rotatory 
direction occurred than in the translation direction. As a result, 
the variation of x cx , which was caused by movement of rotatory 
direction, was large and the variation of x pan also became large 
due to visual servoing to keep track of the battery in center of 
visual field. 



Figure 9 shows one typical example of the obtained desirabil- 
ity function and the control policy when the cost function is 
quadratic and the visual- dynamics is estimated using the linear 
and bilinear models. The upper row corresponds to the LQR's 
case and the middle and bottom rows correspond to the LMDP 
trained with the proposed method using linear and bilinear mod- 
els, respectively. In all figures, the horizontal and the vertical axes 
denote the pan and tile angle of the neck joint, respectively; the 
rest of the state components are set to the desired state. Blue dots 
plotted on middle and lower rows are m z , the center positions of 
the basis functions for approximating the desirability function. 
Although the peak of the desirability functions trained with the 
proposed method is broader than that of the desirability of LQR 
due to function approximation, obtained controllers show almost 
same tendency. 
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FIGURE 7 | Total costs collected by the obtained control policy. Each 
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FIGURE 8 | Mean squared error of each state variable. Each error bar 
represents the standard deviation. 



Next, to evaluate performance of obtained controllers, we 
tested the approaching behavior under the each controller. In 
the test, the initial position of the robot was set at a distance 
of 1.5 (m) left the target. The initial direction for each episode 
was selected randomly a set of three directions; target is placed 
directly in front of the robot, at a 15° offset to the right of 
the robot's line of motion or at a 15° offset to the left side, 
as shown in Figure 10. Figure 11 shows the mean total costs of 
30 episodes, the maximum period in one learning episode was 
15 (s) (50 steps). For comparison, Figure 11 shows only quadratic 
cost function case. Note that the immediate cost in each step 

was regarded as c(x, u) = h(qi(x) + ^ ll M ll 2 )> an d was ignored 
when the target is not visible in the visual field. 

Comparing the total cost among the three controllers using 
quadratic cost as shown in Figure 11, the controller using the 
linear model resulted in the almost same performance to the 
result using LQR controller. This result is reasonable because 



these controllers solve the same problem. The trajectories were 
very similar shown in Figure 12. 

On the other hand, the controller using a bilinear model 
acquired marginally worse result as compared with the other 
controllers. One possible reason is that over fitting occurred in 
bilinear model. 

In comparing performance among all obtained controllers, we 
cannot use the total cost because of the difference on state costs. 
For this reasons we calculated L- 1 norm 2 between the current state 
and the goal state as quantity of controller performance which 
can be comparable in all controllers. Figure 13 shows this. All of 
controllers brought the Spring Dog to almost the goal state in 10 s. 
Particularly, the controllers using the non- quadratic cost func- 
tion brought the Spring Dog closer to the battery pack than other 
controllers. The reason can be considered that the non-quadratic 
cost function gave a lower cost in more narrow region than the 
quadratic cost. 

4. DISCUSSION 

Although it has been reported that the framework of LMDP 
can find an optimal policy faster than conventional reinforce- 
ment learning algorithms, the LMDP requires the knowledge of 
state transition probabilities in advance. In this paper, we demon- 
strated that the LMDP framework can be successfully used with 
the environmental dynamics estimated by model learning. In 
addition, our study is the first attempt to apply the LMDP frame- 
work to real robot tasks. Our method can be regarded as a of 
model-based reinforcement learning algorithms. Although many 
model-based methods includes model learning (Deisenroth et al., 
2009; Hester et al., 2010) have been proposed in this field, they 
compute an optimal state value function which is a solution of 
a non-linear Bellman's equation. Experimental results show that 
our method is applicable to real robot behavior learning which is 
generally stochastic and including non-linear state transition. In 
our proposed method, a cost function is not estimated. However, 
it is possible to extend to estimate a cost function as well as sys- 
tem dynamics simultaneously, because it is usually formulated as 
a standard supervised learning problem. In addition, it is not so 
difficult to assume that a cost function is given in the real robot 
application, because the robot usually compute the reward by 
itself in many application. 

In the swing-up pole task, the linear and linear-NRBF models 
were tested to approximate the pole dynamics. The policy derived 
from the linear model achieved the task of bringing the pole to 
the desired position even though it cannot represent the dynam- 
ics correctly. In the visually- guided navigation task, we compared 
the desirability function and control policy of LMDP with those 
of LQR if the environmental dynamics and the cost function 
were approximated by the linear model and the quadratic func- 
tion, respectively. In this setting, the optimal state value function 
and the control policy were calculated analytically by LQR, and 
therefore, we obtained the optimal desirability function. The 
obtained desirability function and control policy were not exactly 
the same as those of LQR. However, we confirmed that the 



2 The L-l norm of a vector x = (x\ , . . . , x n ) T is the sum of the absolute value 
of the coordinate of x, computed by ||x|| i = J];l x z'l- 
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FIGURE 9 | Results of the robot navigation task. LQR with the linear model is at the top, LMDP with the linear model in the middle, LMDP with the bilinear 
model at the bottom, z(x) on the left, u* efx (x) on the center, i/* ght O) on the right. Black dots represent the centers of the basis functions <pO, u). 




FIGURE 10 | Initial position of the Spring Dog and battery in the test 
phase. Three possible positions of the battery pack are considered. 



performance using the obtained control policy was comparable to 
the performance using LQR. Both models prepared in this exper- 
iment failed to approximate a part of state transition such as oCqx 
and Xp an . This means that the Spring Dog could not predict the 
future position of the battery pack precisely when turned left or 
right. Nevertheless, the robot could approach the battery pack 
appropriately. This result suggests that LMDP with model learn- 
ing is promising even though the estimated model was not so 
accurate. Fortunately, the control policy which brings the robot 
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FIGURE 11 | Average of total cost using the quadratic state cost 
function. Each error bar represents the standard deviation. 



to the desired position can be obtained with simple linear model 
in both experiments. We plan to evaluate the proposed method 
to non-linear control tasks such as learning walking and running 
behaviors. 
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FIGURE 12 | Trajectories of the pan angle x tMt and the immediate state 
cost under the quadratic state cost. 
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FIGURE 13 | Trajectories of the L-1 norm between the current and 
goal states. 



As discussed in section 3, the quality of obtained control policy 
depends on the accuracy of the estimated environmental model. 
For instance, the bilinear model used in the robot experiment did 
not improve the approximation accuracy, as shown in Figure 8, 



even though its computational complexity is a rather than the 
linear model. In addition, a part of the conditional mean |x(x, u) 
was estimated by the least squares method in the current imple- 
mentation but it would be more informative to estimate the state 
transition probability distribution p Uk (xk+i \x^) itself. There exist 
several methods for estimating a probability distribution from 
samples. For example, Gaussian process is widely used to estimate 
environmental dynamics (Deisenroth et al., 2009; Deisenroth and 
Rasmussen, 2011). Sugiyama et al. (2010) proposed the method 
to estimate a conditional density distribution efficiently in the 
manner of density ratio estimation and applied it to state tran- 
sition estimation in simulated environments. One advantage of 
their method is that it can estimate a multi-modal distribution by 
the least squares method. In this case, it is no longer tractable ana- 
lytically to compute the integral operator even if Gaussian basis 
functions are used for approximation, and it should be replaced 
by the Monte Carlo estimates. Integration of sophisticated model 
learning methods with the LMDP framework is our future work. 

The other extension is to develop a model free approach 
of learning desirability functions, in which the environmen- 
tal dynamics is not estimated explicitly. Z learning is a typical 
model- free reinforcement learning method which can learn a 
desirability function for discrete states and actions, and it was 
shown that the learning speed of Z learning was faster than 
that of Q-learning in grid-world maze problems (Todorov, 2007, 
2009b). Application of least squares-based reinforcement learn- 
ing algorithms (Boyan, 2002; Lagoudakis and Parr, 2003) is 
promising direction. However, in the continuous state case, as 
mentioned in section 2.1, the optimality equation derive a trivial 
solution without boundary conditions. In addition, the desir- 
ability function should satisfy the inequality 0 < z(x) < 1 in 
order to recover a correct value function by v{x) = — log(z(X)). 
Furthermore, values of the desirability "function tend to be 
too small" because of the exponential transformation. For 
these reasons boundary conditions must be carefully considered. 
Consequently, the constrained optimization methods should be 
solved to find the optimal desirability function while learning 
of the value function is considered as unconstrained optimiza- 
tion. For the extension of model- free learning, this issue have to 
be solved. 
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APPENDIX 

AOPTIMIZATION OF FUNCTION APPROXIMATION PARAMETERS 

When the cost function is non-negative, the value function v(x) 
is also non-negative, and therefore, the inequality 0 < z(x) < 
1 holds at any x by the definition of the desirability func- 
tion (Equation 10). In order to satisfy this inequality, the con- 
straint Wj > 0 for all i is required since we assume that the 
basis function is a non-normalized Gaussian function. This con- 
strained optimization on w is efficiently solved by the following 
quadratic programming 

mine, s.t. Wj > 0, Vi. (29) 
w 

To optimize 0, it is possible to apply the Levenberg- 
Marquardt algorithm to minimize the square error (Equation 18). 
However, it was reported that the desirability function become 



z(x n ; w, 0) & 0 during the minimization process because the cen- 
ter position of the basis functions m z move away from collocation 
states x n (Todorov, 2009b). To avoid the trivial solution z(x) = 0, 
the following constraint is introduced, 

i t f(0)w = ^2 w ' °) = const - ( 3 °) 

n=l 

This constrained problem is optimized by the Levenberg- 
Marquardt algorithm. When we define J = dr/dQ and g = 
3(l T J F(0)w)/30, then the objective function is given by 

min -8 T (/ T / + y/)8 + 8 T / T r s.t. g T 8 = 0, (31) 
8 2 

where 8 and y denote the gradient direction of the update rule 
and the parameter between 0 and 1, respectively. This is solved by 
the Lagrange multiplier methods. 
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